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ABSTRACT 

Nine of the most important estimators known for the two-point correlation function 
are compared using a predetermined, rigorous criterion. The indicators were extracted 
, from over 500 subsamples of the Virgo Hubble Volume simulation cluster catalog. The 

^ \ "real" correlation function was determined from the full survey in a 3000/i _1 Mpc pe- 

riodic cube. The estimators were ranked by the cumulative probability of returning a 
value within a certain tolerance of the real correlation function. This criterion takes into 
account bias and variance, and it is independent of the possibly non-Gaussian nature of 
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the error statistics. As a result for astrophysical applications a clear recommendation 
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has emerged: the Landy and Szalay (1993) estimator, in its original or grid version (Sza- 



pudi and Szalay 1998), are preferred in comparison to the other indicators examined, 
with a performance almost indistinguishable from the Hamilton (1993) estimator. 



Subject headings: methods: statistical - galaxies: clustering 
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The two-point correlation function of galaxies became one of the most popular statistical tools 
in astronomy and cosmology. If the current paradigm, where the initial Gaussian fluctuations grew 
by gravitational instability, is correct, the two-point correlation function of galaxies is directly 
related to the initial mass power spectrum. While the role of the two-point correlation function 
is central, estimators for extracting it from a set of spatial points are confusingly abundant in the 
literature. We have collected the nine most important forms used in the area of mathematics and 
astronomy. The difference between them lies mainly in their respective method of edge correction. 

The multitude of choices might appear confusing to the practicing observational astronomer. 
The reason is partly due to the lack of a clear criterion to distinguish between the estimators. 
For instance one estimator could have smaller variance under certain circumstances, but it could 
have a bias. Therefore, before doing any numerical experiments, we agreed upon the method of 
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ranking the different estimators. The cumulative probability distribution of the measured value 
lying within a certain tolerance of the "true" value is going beyond the concepts of bias or variance, 
and even takes into account any non-Gaussian behavior of the statistics. This is the mathematical 
formulation of the simple idea that an estimator which is more likely to give values closer to the 
truth is better. After the above criterion was agreed, the plan to elucidate the confusion was clear. 
Collect the different forms of estimators (next section), perform a numerical experiment in several 
subsamples of a large simulation (§3), determining the cumulative probability of measuring values 
close to the true one, thereby ranking the different estimators (§4). 

2. The estimators 

Astrophysical studies favor estimators based on counting pairs, while most of the mathematical 
research is focused on geometric edge correction (second subsection). The following subsections 
collect nine of the most successful and widespread recipes from both genres. 

2.1. Pairwise estimators 

Following Szapudi and Szalay (1998) (hereafter SS), let us define the pair-counts with a func- 
tion <E> symmetric in its arguments 

P DR (r) = Y,Y,*r(x,y). (1) 

xe-Dye-R 

The summation runs over coordinates of points in the data set D and points in the set R of 
randomly distributed points, respectively. This letter considers the two point correlation function, 
for which the appropriate definition is & r (x,y) = [r < d(x,y) < r + A], where d(x,y) is the 
separation of the two points, and [condition] equals 1 when condition holds, otherwise. Pdd and 
Prr are defined analogously, with x and y taken entirely from the data and random samples, under 
the restriction that x ^ y. Let us introduce the normalized counts DD(r) = Pdd(t")/{N(N — 
1)), DR(r) = P DR (r)/(NN R ), RR(r) = P RR (r)/{N R (N R - 1)), with N and N R being the total 
number of data and random points in the survey volume. With the above preparation the pairwise 
estimators used in what follows are the natural estimator £n 5 and the estimators due to Davis 
and Peebles (1983) f DP , Hewett (1982) £ He , Hamilton (1993) £ H a, and Landy and Szalay (1993) 
(hereafter LS) £ls: 

? _DD ? DD ? DD-DR ? _ DD RR ? DD - 2DR + RR 

~ RR ^-DR' 1 ' &c " RR ' CHa -~D^ _1 ' Cls " RR 

(2) 

Note that Hewett 's estimator could be rendered equivalent with the LS estimator if the original 
asymmetric definition of DR is symmetrized; the version we use is the one consistent with the 
notation laid out above. In the case of an angular survey, an optimal weighting scheme can be 
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adapted to any of the above estimators (e.g., Colombi et al. 1998). This is inversely proportional 
to the errors expected at a particular pair separation, essentially equivalent to the Feldman et al. 
(1994) weight. 



2.2. Geometric Estimators 

Alternative estimates of the two-point correlation function from N data points xEfl inside 
a sample window W may be written in the form 

^ (r) + 1 = N(N=Tj E E " (X ' y) - (3) 

|W| is the volume of the sample window and the sum is restricted to pairs of different points x^y. 
For a suitably chosen weight function cj(x, y) these edge corrected estimators are approximately 
unbiased. Such weights are the Ripley (1976) - Rivolo (1986) weight cjr, the Ohser and Stoyan 
(1981) - Fiksel (1988) weight oj f , and the Ohser (1983) weight oj q . 

4irr 2 |W| \W\ 

WR(x,y) = , nW , , wp(x,y) = r, w (x,y) = = - (4) 

area(<9£ r (x) n W) 7w(x - y) 7w(|x-y|) 

where area(<9,B r (x)n>V) is the fraction of the surface area of the sphere £> r (x) with radius r = |x— y| 
around x inside W, the set-covariance 7w( z ) = | W n W z | is the volume of the intersection of the 
original sample W with the set W z shifted by z, and yw(r) is the isotropized set-covariance. We 
will consider the estimators £r, £f> an d based on these weights. A detailed description of these 
estimators may be found in Stoyan et al. (1995) and Kerscher (1999). The Minus or reduced 
sample estimator, employing no weighting scheme at all, may be obtained by looking only at the 

N (r) 

points which are further than r from the boundaries of W: 

^ M(r) + 1 = lY-iv>) E Ll^A ( 5 ) 

xeiX 1 -) yeD 

Estimators of this type are used by Sylos Labini et al. (1998). 

It can be shown that the natural estimator £n is the Monte-Carlo version of the Ohser estimator 
£o- Similarly, the geometric counterparts of the LS and Hamilton estimator may be constructed 
(Kerscher 1999). This allowed us to cross-check our programs. Focusing on improved number 
density estimation, Stoyan and Stoyan (2000) arrived also at the geometrical version of the Hamilton 
estimator. 



3. The comparison 



To compare these estimators for typical cosmological situations we use the cluster catalogue 
generated from the ACDM Hubble-volume simulation (Colberg et al. 1998). In order to investigate 
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the effects of shape, clustering, and the amount of random data used, we have always varied one 
parameter at a time, starting from a fiducial sample. Rectangular subsamples were extracted 
exhausting the full simulation cube: the fiducial cubic subsamples C, slices S, pencil beams P, and 
cubic samples with cutout holes H, all with approximately the same volume and with approximately 
430 clusters each. Cutout holes around bright stars, etc. arise naturally in all realistic surveys. 
The pattern of holes used for this study was directly mapped from a 19° x 19° patch of the EDSGC 
survey to one of the faces of the simulation sub-cube. Then the holes were continued across the 
subsample, parallel with the sides corresponding to a distant observer approximation. The physical 
size of the holes roughly corresponded to a redshift survey at a depth of about 300/t -1 Mpc. All 
these point sets are "fully sampled" and may be considered as volume limited samples. 

In addition, Poisson samples N, i.e. without clustering, were generated. All calculations em- 
ployed Nr = 100k random points for the pairwise estimators, unless otherwise noted. This was 
sufficient for all indicators to converge. The calculations were repeated for sample C with Nr = lk 
and Nr = 10k random points, denoted with Rl and RIO, respectively, to investigate the speed of 
convergence of the different estimators with respect to the random point density. The parameters 
for the samples are summarized in Table 1. 

The two-point correlation function £ per extracted from all clusters inside the 3/i _1 Gpc cube 
provided our reference or "true value" . Since the simulation was carried out with periodic boundary 
conditions, the cluster distribution is also periodic, therefore the torus boundary correction is exact 
(Ripley 1988). 

The nine estimators defined above for the two-point correlation function were determined from 
each of the n$ subsamples. For a given radial bin r we computed the deviation |£*(r) — £ pe r(f)| 
of the estimated two-point correlation function £*(r) from the reference £ per (0- The empirical 
distribution of these deviations provides an objective basis for the comparison of the utility of 
the estimators. The large number of samples enabled the numerical estimation of the probability 
— £pcr| < d) that the deviation |£* — £ per | is smaller than a tolerance d. The larger this 
probability, the more likely that the estimator will be within the predetermined range from one 
sample. In general it could happen that the rank of two estimators reverses as the tolerance varies, 
but as will be shown in the next section, this is quite atypical. This procedure is more general than 
considering only bias and variance, which yields only a full description if the above distribution is 
the integral of a Gaussian. Note that a small bias is negligible for practical purposes if the variance 
dominates the distribution of the deviations. It is worthwhile to note that Gaussian assumption 
yields a surprisingly good description of the deviations |£*(r) — £ pe r(?")|. For estimators of the closely 
related product density, asymptotic Gaussianity of the deviations was proven by Heinrich (1988). 

Fig. 1 shows the distribution P(|£*(r) — £ pe r(OI < d) for the samples described in Table 1. 
Three typical scales are displayed to illustrate the general behavior. The principal conclusions to 
be drawn are the following: 
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Small scales (r = 4.4/i _1 Mpc y ): the effect of any boundary correction scheme becomes negli- 
gible, and as expected, all the estimators exhibit nearly identical behavior. The same is true for 
the samples S, P, N, and H not shown. However, some of the estimators are more sensitive to 
the density of random points, especially the Hamilton estimator, followed by the Davis-Peebles 
estimator. They show stronger deviations for the Rl and RIO samples due to the poor sampling 
of the DR term (see also Pons-Borderfa et al. 1999). This effect persists on large scales as well. 

Intermediate scales (r = 31h~ 1 Mpc): similar to the small scales. The Minus estimator shows 
stronger deviation becoming even more pronounced for the S, P, and H samples, since the effective 
remaining volume decreases. 

Large scales (r = 115/i _1 Mpc,): edge corrections are becoming important, and the estimators 
exhibit clear differences in their distributions of the deviations for the samples C, and N. For a given 
probability, the Minus estimator shows the largest deviations, followed by the Natural, Fiksel and 
Ohser estimators. Significantly smaller deviations are obtained for the Davis-Peebles and Hewett 
estimators, and yet smaller for the Rivolo estimator. Finally, the Hamilton and LS estimator 
display the smallest deviations and thus the best edge correction. The two latter distributions 
nearly overlap. The above conclusions are robust and only weakly influenced by the presence of 
cutout holes, as seen from the H sample. The geometry of the subsamples has a non-trivial effect 
on the distributions. While the deviations are increased in the S and P samples, the differences 
between the estimators are reduced in the S, becoming negligible in the P samples. In both cases, 
the Minus estimator is not usable any more, since the equal zero, whereas the Fiksel estimator 
is biased for such geometries on large scales (this is implicitly shown in the work of Ohser (1983)). 

Following Szapudi and Szalay (1999) the variance of the LS estimator may be calculated for a 
Poisson process: 

^ s(r) = v£w 2 ' (6) 

with V A (r) = j w d 3 xj w d 3 y $ r (x,y). The (Jls calculated for the considered samples is also 
shown in Fig. 1 to illustrate how much discreteness effects contribute to the distribution of the 
deviations. For our choice of sample parameters, the discreteness contribution, i.e. the deviation of 
a corresponding Poisson sample, is always within a factor of few of other important contributions 
to the variance, such as finite volume and edge effects. In general, the ratio of discreteness effects 
to the full variance depends in a complicated non-linear fashion on the number of clusters in the 
sample, the shape of the survey, integrals over the two-point correlation function and its square, and 
the three- and four-point correlation functions (see Szapudi et al. 2000 for the exact calculation). 
Varying the side length of the cubic samples from 300/i _1 Mpc, 375/i~ 1 Mpc, 600/i _1 Mpc to l/i _1 Gpc 
we explored the influence of the size of the sample on the discreteness effects. Still, the rank order 
of the estimators stayed invariant. 
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4. Summary and Conclusion 

For a sample with 222,052 clusters extracted from the Virgo Hubble volume simulation a 
reference two-point correlation function was determined. Within over 500 subsamples several 
estimators for the two-point correlation function were employed, and the results compared with 
the reference value. On small scales all the estimators are comparable. On large scales the LS 
and the Hamilton estimator significantly outperform the rest, showing the smallest deviations for 
a given cumulative probability. While the two estimators yield almost identical results for infinite 
number of random points, the Hamilton estimator is considerably more sensitive to the number of 
random points employed than the LS version. From a practical point of view the LS estimator is 
thus preferable. The rest of the estimators can be divided into three categories: The first runner- 
ups are the estimators from Rivolo, Davis-Peebles and Hewett, but already with a significantly 
increased variance. Even larger deviations are present for the Natural, Fiksel, and Ohser recipes. 
The Minus estimator has the largest deviation. Although it was shown that for special point 
processes both the LS and Hamilton estimator are biased (Kerscher 1999), the present numerical 
experiment demonstrates that this is irrelevant for the realistic galaxy and cluster point processes, as 
the bias has an insignificant effect on the distribution of deviations. Pons-Borderfa et al. (1999) did 
not recommend one estimator for all cases. In contrast, through our extensive numerical treatment 
the LS estimator emerges as a clear recommendation. 

The above considerations apply to volume limited samples. When the correlation function is 
estimated directly from a flux limited sample with an appropriate minimum variance pair weighting 
(Feldman et al. (1994)), the Hamilton estimator has the advantage of being independent of the the 
normalization of the selection function. 

The differences between the estimators become smaller for the slice and insignificant for the 
pencil beam samples. At first sight this is counter intuitive: the difference between the estimators 
is largely due to edge corrections, and less compact surveys obviously have more edges. For the 
large scale considered, S and P become essentially two- and one-dimensional and the weight 
oj(x, y) « U(r) is equal for most of the pairs separated by r. Since the geometric estimators are 
approximately unbiased they employ mainly the same weight TD{r) on large scales, and consequently 
show the same distribution of deviations. This argument also applies to the pair estimators, since 
they may be written in terms of these weights (Kerscher 1999). 

All the above numerical investigations are intimately related to the problem of calculating 
the expected errors on estimators for correlation functions. To include all contributions, such as 
edge, discreteness, and finite volume effects, the method by Colombi et al. (1994), Szapudi and 
Colombi (1996), Colombi et al. (1998), Szapudi et al. (1999) has to be extended for the two point 
correlation function. Such a calculation was performed by Szapudi et al. (2000), (see also Stoyan 
et al. 1993, Bernstein 1994 and Hamilton 1993 for approximations) and should be used for ab initio 
error calculations. 

It is worth to mention, that one of the most widely used method in the literature, "bootstrap" , 
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is based on a misunderstanding of the concept. For bootstrap in spatial statistics, a whole sample 
takes the role of one point in the original bootstrap procedure. This means that replicas of the 
original surveys would be needed to fulfill the promise of the bootstrap method. Choosing points 
(i.e. individual galaxies, clusters, etc.) randomly from one sample, as usually done, yields a variance 
with no obvious relation to the variance sought (see also Snethlage 1999). 

The role of the random samples is to represent the shape of the survey in a Monte-Carlo 
fashion. A practical alternative is to put a fine grid on the survey and calculate the quantities 
DD, DW, and WW, where D now represents bin-counts and W the indicator function taking the 
value one for pixels inside the survey, and zero otherwise. According to SS, all the above estimators 
have an analogous "grid" version (see also Hamilton 1993) which can be obtained formally by the 
substitution R — > W . In practice grid estimators can be more efficient than pair counts, and except 
for a slight perturbation of the pair separation bins, they both yield almost identical results for 
scales larger than a few pixel size. 

The usual way of estimating the power spectrum, using a folding with the Fourier transform 
of the sample geometry is equivalent to the grid version of the LS estimator. Hence, such a power 
spectrum analysis extracts the same amount of information from the data as the analysis with the 
two-point correlation function using the grid version of the LS estimator. The results are only 
displayed with respect to a different basis. Similarly, Karhunen-Loeve (KL) modes form another 
set of basis functions (Vogeley and Szalay 1996). The uncorrelated power spectrum (Hamilton 
2000) and the KL modes are the methods of choice for cosmological parameter estimation. The KL 
modes allow for a well-defined cut-off, and therefore reduce the computational needs in a maximum 
likelihood analysis. However, geometrical features of the galaxy and cluster distribution directly 
show up in two-point correlation function and may be interpreted easily. Each bin of the two- 
point correlation function contains direct information on pairs separated by a certain distance, an 
intuitively simple concept more suitable to study and control (expected or unexpected) systematics 
(geometry, luminosity, galaxy properties, biases) than any other representation. In this sense the 
correlation function is a tool complementary to the power spectrum. 
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Table 1: The description of the samples. L xy , and L z are the side-lengths of the rectangular 
samples, ns is the number of samples exhausting the periodic cube with 222,052 clusters in total, 
N the mean number of clusters inside the samples, and Nr the number of random points used. 
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Fig. 1. — The cumulative probability distribution P(|£*(r) — £ P er(?")l < d) of the deviations d are 
shown for several samples and radii r. In the plots for the samples C, S, P, H, N we use the 
following symbols for the deviations of the estimators: Natural (long dashed), DP (dotted-dashed) , 
Hewett (short dashed), Hamilton (dotted), LS (solid), Rivolo (solid, thick), Fiksel (dotted, thick), 
Ohser (short dashed, thick), and Minus (long dashed, thick). (On large scales neither the Minus nor 
the Fiksel are applicable in the samples S and P and consequently no results for them are shown.) 
The horizontal lines starting at 0.683 marks the value of ctls fo r a Poisson process according to 
Eq. (6) in the geometry considered. In the plots for Rl and RIO the solid line marks the result for 
the LS estimator using 100k points. The estimators using lk and 10k random points respectively 
are marked in the following way: DP (dotted), Hamilton (short dashed), LS (long dashed). For 
better visibility we lay a Gaussian distance distribution (dashed-dotted) over the result for the LS 
estimator only in the plot for Rl. A similar perfect agreement would be obtained in the other 
cases. 



